Stability (learning Theory)
   HOME

TheInfoList



OR:

Stability, also known as algorithmic stability, is a notion in
computational learning theory In computer science, computational learning theory (or just learning theory) is a subfield of artificial intelligence devoted to studying the design and analysis of machine learning algorithms. Overview Theoretical results in machine learning m ...
of how a machine learning algorithm is perturbed by small changes to its inputs. A stable learning algorithm is one for which the prediction does not change much when the training data is modified slightly. For instance, consider a machine learning algorithm that is being trained to recognize handwritten letters of the alphabet, using 1000 examples of handwritten letters and their labels ("A" to "Z") as a training set. One way to modify this training set is to leave out an example, so that only 999 examples of handwritten letters and their labels are available. A stable learning algorithm would produce a similar classifier with both the 1000-element and 999-element training sets. Stability can be studied for many types of learning problems, from
language learning Language acquisition is the process by which humans acquire the capacity to perceive and comprehend language (in other words, gain the ability to be aware of language and to understand it), as well as to produce and use words and sentences to ...
to
inverse problem An inverse problem in science is the process of calculating from a set of observations the causal factors that produced them: for example, calculating an image in X-ray computed tomography, source reconstruction in acoustics, or calculating the ...
s in physics and engineering, as it is a property of the learning process rather than the type of information being learned. The study of stability gained importance in
computational learning theory In computer science, computational learning theory (or just learning theory) is a subfield of artificial intelligence devoted to studying the design and analysis of machine learning algorithms. Overview Theoretical results in machine learning m ...
in the 2000s when it was shown to have a connection with
generalization A generalization is a form of abstraction whereby common properties of specific instances are formulated as general concepts or claims. Generalizations posit the existence of a domain or set of elements, as well as one or more common characteri ...
. It was shown that for large classes of learning algorithms, notably
empirical risk minimization Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance. The core idea is that we cannot know exactly how well an alg ...
algorithms, certain types of stability ensure good generalization.


History

A central goal in designing a machine learning system is to guarantee that the learning algorithm will
generalize A generalization is a form of abstraction whereby common properties of specific instances are formulated as general concepts or claims. Generalizations posit the existence of a domain or set of elements, as well as one or more common characteri ...
, or perform accurately on new examples after being trained on a finite number of them. In the 1990s, milestones were reached in obtaining generalization bounds for supervised learning algorithms. The technique historically used to prove generalization was to show that an algorithm was
consistent In classical deductive logic, a consistent theory is one that does not lead to a logical contradiction. The lack of contradiction can be defined in either semantic or syntactic terms. The semantic definition states that a theory is consistent i ...
, using the
uniform convergence In the mathematical field of analysis, uniform convergence is a mode of convergence of functions stronger than pointwise convergence. A sequence of functions (f_n) converges uniformly to a limiting function f on a set E if, given any arbitrarily s ...
properties of empirical quantities to their means. This technique was used to obtain generalization bounds for the large class of
empirical risk minimization Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance. The core idea is that we cannot know exactly how well an alg ...
(ERM) algorithms. An ERM algorithm is one that selects a solution from a hypothesis space H in such a way to minimize the empirical error on a training set S. A general result, proved by
Vladimir Vapnik Vladimir Naumovich Vapnik (russian: Владимир Наумович Вапник; born 6 December 1936) is one of the main developers of the Vapnik–Chervonenkis theory of statistical learning, and the co-inventor of the support-vector machine ...
for an ERM binary classification algorithms, is that for any target function and input distribution, any hypothesis space H with VC-dimension d, and n training examples, the algorithm is consistent and will produce a training error that is at most O\left(\sqrt\right) (plus logarithmic factors) from the true error. The result was later extended to almost-ERM algorithms with function classes that do not have unique minimizers. Vapnik's work, using what became known as
VC theory VC may refer to: Military decorations * Victoria Cross, a military decoration awarded by the United Kingdom and also by certain Commonwealth nations ** Victoria Cross for Australia ** Victoria Cross (Canada) ** Victoria Cross for New Zealand * Vic ...
, established a relationship between generalization of a learning algorithm and properties of the hypothesis space H of functions being learned. However, these results could not be applied to algorithms with hypothesis spaces of unbounded VC-dimension. Put another way, these results could not be applied when the information being learned had a complexity that was too large to measure. Some of the simplest machine learning algorithms—for instance, for regression—have hypothesis spaces with unbounded VC-dimension. Another example is language learning algorithms that can produce sentences of arbitrary length. Stability analysis was developed in the 2000s for
computational learning theory In computer science, computational learning theory (or just learning theory) is a subfield of artificial intelligence devoted to studying the design and analysis of machine learning algorithms. Overview Theoretical results in machine learning m ...
and is an alternative method for obtaining generalization bounds. The stability of an algorithm is a property of the learning process, rather than a direct property of the hypothesis space H, and it can be assessed in algorithms that have hypothesis spaces with unbounded or undefined VC-dimension such as nearest neighbor. A stable learning algorithm is one for which the learned function does not change much when the training set is slightly modified, for instance by leaving out an example. A measure of
Leave one out error For mathematical analysis and statistics, Leave-one-out error can refer to the following: * Leave-one-out cross-validation Stability (CVloo, for ''stability of Cross Validation with leave one out''): An algorithm f has CVloo stability β with re ...
is used in a Cross Validation Leave One Out (CVloo) algorithm to evaluate a learning algorithm's stability with respect to the loss function. As such, stability analysis is the application of
sensitivity analysis Sensitivity analysis is the study of how the uncertainty in the output of a mathematical model or system (numerical or otherwise) can be divided and allocated to different sources of uncertainty in its inputs. A related practice is uncertainty anal ...
to machine learning.


Summary of classic results

* Early 1900s - Stability in learning theory was earliest described in terms of continuity of the learning map L, traced to
Andrey Nikolayevich Tikhonov Andrey Nikolayevich Tikhonov (russian: Андре́й Никола́евич Ти́хонов; October 17, 1906 – October 7, 1993) was a leading Soviet Russian mathematician and geophysicist known for important contributions to topology, fu ...
. * 1979 - Devroye and Wagner observed that the leave-one-out behavior of an algorithm is related to its sensitivity to small changes in the sample.L. Devroye and Wagner, Distribution-free performance bounds for potential function rules, IEEE Trans. Inf. Theory 25(5) (1979) 601–604. * 1999 - Kearns and Ron discovered a connection between finite VC-dimension and stability. * 2002 - In a landmark paper, Bousquet and Elisseeff proposed the notion of ''uniform hypothesis stability'' of a learning algorithm and showed that it implies low generalization error. Uniform hypothesis stability, however, is a strong condition that does not apply to large classes of algorithms, including ERM algorithms with a hypothesis space of only two functions.O. Bousquet and A. Elisseeff. Stability and generalization. J. Mach. Learn. Res., 2:499–526, 2002. * 2002 - Kutin and Niyogi extended Bousquet and Elisseeff's results by providing generalization bounds for several weaker forms of stability which they called ''almost-everywhere stability''. Furthermore, they took an initial step in establishing the relationship between stability and consistency in ERM algorithms in the Probably Approximately Correct (PAC) setting. * 2004 - Poggio et al. proved a general relationship between stability and ERM consistency. They proposed a statistical form of leave-one-out-stability which they called ''CVEEEloo stability'', and showed that it is a) sufficient for generalization in bounded loss classes, and b) necessary and sufficient for consistency (and thus generalization) of ERM algorithms for certain loss functions such as the square loss, the absolute value and the binary classification loss. * 2010 - Shalev Shwartz et al. noticed problems with the original results of Vapnik due to the complex relations between hypothesis space and loss class. They discuss stability notions that capture different loss classes and different types of learning, supervised and unsupervised. * 2016 - Moritz Hardt et al. proved stability of gradient descent given certain assumption on the hypothesis and number of times each instance is used to update the model.


Preliminary definitions

We define several terms related to learning algorithms training sets, so that we can then define stability in multiple ways and present theorems from the field. A machine learning algorithm, also known as a learning map L, maps a training data set, which is a set of labeled examples (x,y), onto a function f from X to Y, where X and Y are in the same space of the training examples. The functions f are selected from a hypothesis space of functions called H. The training set from which an algorithm learns is defined as S = \ and is of size m in Z = X \times Y drawn i.i.d. from an unknown distribution D. Thus, the learning map L is defined as a mapping from Z_m into H, mapping a training set S onto a function f_S from X to Y. Here, we consider only deterministic algorithms where L is symmetric with respect to S, i.e. it does not depend on the order of the elements in the training set. Furthermore, we assume that all functions are measurable and all sets are countable. The loss V of a hypothesis f with respect to an example z = (x,y) is then defined as V(f,z) = V(f(x),y). The empirical error of f is I_S = \frac\sum V(f,z_i). The true error of f is I = \mathbb_z V(f,z) Given a training set S of size m, we will build, for all i = 1....,m, modified training sets as follows: * By removing the i-th element S^ = \ * By replacing the i-th element S^i = \


Definitions of stability


Hypothesis Stability

An algorithm L has hypothesis stability β with respect to the loss function V if the following holds: \forall i\in \, \mathbb_ V(f_S,z_i)-V(f_,z_i), leq\beta.


Error Stability

An algorithm L has error stability β with respect to the loss function V if the following holds: \forall S\in Z^m, \forall i\in\, , \mathbb_z (f_S,z)\mathbb_z (f_,z)\leq\beta


Uniform Stability

An algorithm L has uniform stability β with respect to the loss function V if the following holds: \forall S\in Z^m, \forall i\in\, \sup_, V(f_S,z)-V(f_,z), \leq\beta A probabilistic version of uniform stability β is: \forall S\in Z^m, \forall i\in\, \mathbb_S\\geq1-\delta An algorithm is said to be stable, when the value of \beta decreases as O(\frac).


Leave-one-out cross-validation (CVloo) Stability

An algorithm L has CVloo stability β with respect to the loss function V if the following holds: \forall i\in\, \mathbb_S\\geq1 - \delta_ The definition of (CVloo) Stability is equivalent to Pointwise-hypothesis stability seen earlier.


Expected-leave-one-out error (Eloo_) Stability

An algorithm L has Eloo_ stability if for each n there exists a \beta_^m and a \delta_^m such that: \forall i\in\, \mathbb_S\\geq1-\delta_^m, with \beta_^m and \delta_^m going to zero for m,\rightarrow\infty


Classic theorems

From Bousquet and Elisseeff (02): For symmetric learning algorithms with bounded loss, if the algorithm has Uniform Stability with the probabilistic definition above, then the algorithm generalizes. Uniform Stability is a strong condition which is not met by all algorithms but is, surprisingly, met by the large and important class of Regularization algorithms. The generalization bound is given in the article. From Mukherjee et al. (06): *For symmetric learning algorithms with bounded loss, if the algorithm has ''both'' Leave-one-out cross-validation (CVloo) Stability and Expected-leave-one-out error (Eloo_) Stability as defined above, then the algorithm generalizes. *Neither condition alone is sufficient for generalization. However, both together ensure generalization (while the converse is not true). *For ERM algorithms specifically (say for the square loss), Leave-one-out cross-validation (CVloo) Stability is both necessary and sufficient for consistency and generalization. This is an important result for the foundations of learning theory, because it shows that two previously unrelated properties of an algorithm, stability and consistency, are equivalent for ERM (and certain loss functions). The generalization bound is given in the article.


Algorithms that are stable

This is a list of algorithms that have been shown to be stable, and the article where the associated generalization bounds are provided. *
Linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is call ...
*k-NN classifier with a loss function. *
Support Vector Machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratorie ...
(SVM) classification with a bounded kernel and where the regularizer is a norm in a Reproducing Kernel Hilbert Space. A large regularization constant C leads to good stability. *Soft margin SVM classification. * Regularized Least Squares regression. *The minimum relative entropy algorithm for classification. *A version of bagging regularizers with the number k of regressors increasing with n.Rifkin, R. Everything Old is New Again: A fresh look at historical approaches in machine learning. Ph.D. Thesis, MIT, 2002 *Multi-class SVM classification. *All learning algorithms with Tikhonov regularization satisfies Uniform Stability criteria and are, thus, generalizable.Rosasco, L. and Poggio, T
Stability of Tikhonov Regularization
2009


References


Further reading

*S.Kutin and P.Niyogi.Almost-everywhere algorithmic stability and generalization error. In Proc. of UAI 18, 2002 *S. Rakhlin, S. Mukherjee, and T. Poggio. Stability results in learning theory. Analysis and Applications, 3(4):397–419, 2005 *V.N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995 *Vapnik, V., Statistical Learning Theory. Wiley, New York, 1998 *Poggio, T., Rifkin, R., Mukherjee, S. and Niyogi, P., "Learning Theory: general conditions for predictivity", Nature, Vol. 428, 419-422, 2004 *Andre Elisseeff, Theodoros Evgeniou, Massimiliano Pontil, Stability of Randomized Learning Algorithms, Journal of Machine Learning Research 6, 55–79, 2010 *Elisseeff, A. Pontil, M., Leave-one-out Error and Stability of Learning Algorithms with Applications, NATO SCIENCE SERIES SUB SERIES III COMPUTER AND SYSTEMS SCIENCES, 2003, VOL 190, pages 111-130 *Shalev Shwartz, S., Shamir, O., Srebro, N., Sridharan, K., Learnability, Stability and Uniform Convergence, Journal of Machine Learning Research, 11(Oct):2635-2670, 2010 {{Refend Machine learning Learning>V(f_S,z)-V(f_,z), leq\beta.


Point-wise Hypothesis Stability

An algorithm L has point-wise hypothesis stability β with respect to the loss function V if the following holds: \forall i\in\ \, \mathbb_ V(f_S,z_i)-V(f_,z_i), leq\beta.


Error Stability

An algorithm L has error stability β with respect to the loss function V if the following holds: \forall S\in Z^m, \forall i\in\, , \mathbb_z (f_S,z)\mathbb_z (f_,z)\leq\beta


Uniform Stability

An algorithm L has uniform stability β with respect to the loss function V if the following holds: \forall S\in Z^m, \forall i\in\, \sup_, V(f_S,z)-V(f_,z), \leq\beta A probabilistic version of uniform stability β is: \forall S\in Z^m, \forall i\in\, \mathbb_S\\geq1-\delta An algorithm is said to be stable, when the value of \beta decreases as O(\frac).


Leave-one-out cross-validation (CVloo) Stability

An algorithm L has CVloo stability β with respect to the loss function V if the following holds: \forall i\in\, \mathbb_S\\geq1 - \delta_ The definition of (CVloo) Stability is equivalent to Pointwise-hypothesis stability seen earlier.


Expected-leave-one-out error (Eloo_) Stability

An algorithm L has Eloo_ stability if for each n there exists a \beta_^m and a \delta_^m such that: \forall i\in\, \mathbb_S\\geq1-\delta_^m, with \beta_^m and \delta_^m going to zero for m,\rightarrow\infty


Classic theorems

From Bousquet and Elisseeff (02): For symmetric learning algorithms with bounded loss, if the algorithm has Uniform Stability with the probabilistic definition above, then the algorithm generalizes. Uniform Stability is a strong condition which is not met by all algorithms but is, surprisingly, met by the large and important class of Regularization algorithms. The generalization bound is given in the article. From Mukherjee et al. (06): *For symmetric learning algorithms with bounded loss, if the algorithm has ''both'' Leave-one-out cross-validation (CVloo) Stability and Expected-leave-one-out error (Eloo_) Stability as defined above, then the algorithm generalizes. *Neither condition alone is sufficient for generalization. However, both together ensure generalization (while the converse is not true). *For ERM algorithms specifically (say for the square loss), Leave-one-out cross-validation (CVloo) Stability is both necessary and sufficient for consistency and generalization. This is an important result for the foundations of learning theory, because it shows that two previously unrelated properties of an algorithm, stability and consistency, are equivalent for ERM (and certain loss functions). The generalization bound is given in the article.


Algorithms that are stable

This is a list of algorithms that have been shown to be stable, and the article where the associated generalization bounds are provided. *
Linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is call ...
*k-NN classifier with a loss function. *
Support Vector Machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratorie ...
(SVM) classification with a bounded kernel and where the regularizer is a norm in a Reproducing Kernel Hilbert Space. A large regularization constant C leads to good stability. *Soft margin SVM classification. * Regularized Least Squares regression. *The minimum relative entropy algorithm for classification. *A version of bagging regularizers with the number k of regressors increasing with n.Rifkin, R. Everything Old is New Again: A fresh look at historical approaches in machine learning. Ph.D. Thesis, MIT, 2002 *Multi-class SVM classification. *All learning algorithms with Tikhonov regularization satisfies Uniform Stability criteria and are, thus, generalizable.Rosasco, L. and Poggio, T
Stability of Tikhonov Regularization
2009


References


Further reading

*S.Kutin and P.Niyogi.Almost-everywhere algorithmic stability and generalization error. In Proc. of UAI 18, 2002 *S. Rakhlin, S. Mukherjee, and T. Poggio. Stability results in learning theory. Analysis and Applications, 3(4):397–419, 2005 *V.N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995 *Vapnik, V., Statistical Learning Theory. Wiley, New York, 1998 *Poggio, T., Rifkin, R., Mukherjee, S. and Niyogi, P., "Learning Theory: general conditions for predictivity", Nature, Vol. 428, 419-422, 2004 *Andre Elisseeff, Theodoros Evgeniou, Massimiliano Pontil, Stability of Randomized Learning Algorithms, Journal of Machine Learning Research 6, 55–79, 2010 *Elisseeff, A. Pontil, M., Leave-one-out Error and Stability of Learning Algorithms with Applications, NATO SCIENCE SERIES SUB SERIES III COMPUTER AND SYSTEMS SCIENCES, 2003, VOL 190, pages 111-130 *Shalev Shwartz, S., Shamir, O., Srebro, N., Sridharan, K., Learnability, Stability and Uniform Convergence, Journal of Machine Learning Research, 11(Oct):2635-2670, 2010 {{Refend Machine learning Learning